This is a fast-paced course that covers a lot of material. There will be a large amount of references. You may need to do your own research to fill in the gaps in between lectures and homework/projects. It is impossible to learn data science without getting your hands dirty. Please budget your time evenly. Last-minute work ethic will not work for this course.
Homework in this course is different from your usual homework assignment as a typical student. Most of the time, they are built over real case studies. While you will be applying methods covered in lectures, you will also find that extra teaching materials appear here. The focus will be always on the goals of the study, the usefulness of the data gathered, and the limitations in any conclusions you may draw. Always try to challenge your data analysis in a critical way. Frequently, there are no unique solutions.
Case studies in each homework can be listed as your data science projects (e.g. on your CV) where you see fit.
R-studio and
RMarkdowndplyrggplotHomework assignments can be done in a group consisting of up to three members. Please find your group members as soon as possible and register your group on our Canvas site.
All work submitted should be completed in the R Markdown format. You can find a cheat sheet for R Markdown here For those who have never used it before, we urge you to start this homework as soon as possible.
Submit the following files, one submission for each group: (1) Rmd file, (2) a compiled HTML or pdf version, and (3) all necessary data files if different from our source data. You may directly edit this .rmd file to add your answers. If you intend to work on the problems separately within your group, compile your answers into one Rmd file before submitting. We encourage that you at least attempt each problem by yourself before working with your teammates. Additionally, ensure that you can ‘knit’ or compile your Rmd file. It is also likely that you need to configure Rstudio to properly convert files to PDF. These instructions might be helpful.
In general, be as concise as possible while giving a fully
complete answer to each question. All necessary datasets are available
in this homework folder on Canvas. Make sure to document your code with
comments (written on separate lines in a code chunk using a hashtag
# before the comment) so the teaching fellows can follow
along. R Markdown is particularly useful because it follows a ‘stream of
consciousness’ approach: as you write code in a code chunk, make sure to
explain what you are doing outside of the chunk.
A few good or solicited submissions will be used as sample solutions. When those are released, make sure to compare your answers and understand the solutions.
dplyr and
ggplot)How successful is the Wharton Talk Show Business Radio Powered by the Wharton School
Background: Have you ever listened to SiriusXM? Do you know there is a Talk Show run by Wharton professors in Sirius Radio? Wharton launched a talk show called Business Radio Powered by the Wharton School through the Sirius Radio station in January of 2014. Within a short period of time the general reaction seemed to be overwhelmingly positive. To find out the audience size for the show, we designed a survey and collected a data set via MTURK in May of 2014. Our goal was to estimate the audience size. There were 51.6 million Sirius Radio listeners then. One approach is to estimate the proportion of the Wharton listeners to that of the Sirius listeners, \(p\), so that we will come up with an audience size estimate of approximately 51.6 million times \(p\).
To do so, we launched a survey via Amazon Mechanical Turk (MTurk) on May 24, 2014 at an offered price of $0.10 for each answered survey. We set it to be run for 6 days with a target maximum sample size of 2000 as our goal. Most of the observations came in within the first two days. The main questions of interest are “Have you ever listened to Sirius Radio” and “Have you ever listened to Sirius Business Radio by Wharton?”. A few demographic features used as control variables were also collected; these include Gender, Age and Household Income.
We requested that only people in United States answer the questions. Each person can only fill in the questionnaire once to avoid duplicates. Aside from these restrictions, we opened the survey to everyone in MTurk with a hope that the sample would be more randomly chosen.
The raw data is stored as Survey_results_final.csv on
Canvas.
Select only the variables Age, Gender, Education Level, Household Income in 2013, Sirius Listener?, Wharton Listener? and Time used to finish the survey.
Reading in the data
Change the variable names to be “age”, “gender”, “education”, “income”, “sirius”, “wharton”, “worktime”.
As in real world data with user input, the data is incomplete, with missing values, and has incorrect responses. There is no general rule for dealing with these problems beyond “use common sense.” In whatever case, explain what the problems were and how you addressed them. Be sure to explain your rationale for your chosen methods of handling issues with the data. Do not use Excel for this, however tempting it might be.
Tip: Reflect on the reasons for which data could be wrong or missing. How would you address each case? For this homework, if you are trying to predict missing values with regression, you are definitely overthinking. Keep it simple.
How many blanks in
Before this point, I thought there were just blanks and NAs in the
data, but discovered there were unchanged entire fields in the form os
“select one” for at least the eucatuon column. I went back to tabulate
these entires with summarise(). The reason the table shows
41 total counts at the filer shows 34 rows is because some people have
multiple variables missing.
This shows there aren’t any NAs in the data, but there are 22 “blanks” or missing values. There are also more unanswered questions which indicate the “select one” option that people have not changed.
Income could be missing because people did not have an income in the indicated year. Education could be missing if people have little or did not know what to report.
Altogehter there are 34 entries which match this condition and will be reomved be removed.
## Warning: NAs introduced by coercion
Tip: Reflect on the reasons for which data could be wrong or missing. How would you address each case? For this homework, if you are trying to predict missing values with regression, you are definitely overthinking. Keep it simple.
Within the age variable, 3 people did not respond, 2
people selected ages that did not make sense in the context of the
survey (4 and 223), and one person wrote their age as a character
“eighteen (18)”. Since these were likely errors on the part of the
participant, we decided to fully remove their surveys
Within the gender variable, 6 people did not
respond. We decided to fully remove their surveys.
Within the education variable, 19 selected ‘select
one’. We decided to keep their surveys but change the result to
“other”.
Within the income variable, 6 people did not
respond. We decided to fully remove their surveys.
Within the sirius variable, 5 people did not
respond. We decided to fully remove their surveys.
Within the wharton variable, 4 people did not
respond. We decided to fully remove their surveys.
We did not remove any values from the worktime
variable
Write a brief report to summarize all the variables collected. Include both summary statistics (including sample size) and graphical displays such as histograms or bar charts where appropriate. Comment on what you have found from this sample. (For example - it’s very interesting to think about why would one work for a job that pays only 10cents/each survey? Who are those survey workers? The answer may be interesting even if it may not directly relate to our goal.)
received 1740 surveys - age - right skewed, younger participants - youngest was 18 and the oldest was 76 - most people were in the early to late twenties - gender - 42% Female, 58% male - There are 729 Females and 997 Males in the survey pool - education
Income
worktime
Now we have an age range of 18-76
We can see the new distribution of ages with a historgram
## Warning in geom_vline(aes(xintercept = mean(age)), color = "red", linetype =
## "dashed", : Ignoring unknown parameters: `sixe`
5d0ed39005b074bcca1f3ccdd3eab194461fdea9
There are 729 Females and 997 Males in the survey pool
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
The population from which the sample is drawn determines where the results of our analysis can be applied or generalized. We include some basic demographic information for the purpose of identifying sample bias, if any exists. Combine our data and the general population distribution in age, gender and income to try to characterize our sample on hand.
Does this sample appear to be a random sample from the general population of the USA?
Does this sample appear to be a random sample from the MTURK population?
Note: You can not provide evidence by simply looking at our data here. For example, you need to find distribution of education in our age group in US to see if the two groups match in distribution. You may need to gather some background information about the MTURK population to have a slight sense if this particular sample seem to a random sample from there… Please do not spend too much time gathering evidence.
from the Census Bureau’s Annual Estimates of the Resident Population by Sex, Race, and Hispanic Origin for the United States: April 1, 2010 to July 1, 2019
## Warning in geom_text(aes(label = scales::percent(pct), y = pct, stat =
## "count", : Ignoring unknown aesthetics: stat
## New names:
## New names:
## • `` -> `...1`
Give a final estimate of the Wharton audience size in January 2014. Assume that the sample is a random sample of the MTURK population, and that the proportion of Wharton listeners vs. Sirius listeners in the general population is the same as that in the MTURK population. Write a brief executive summary to summarize your findings and how you came to that conclusion.
The final estimate is that there are 2,585,789 Wharton listeners.
To be specific, you should include:
Now suppose you are asked to design a study to estimate the audience size of Wharton Business Radio Show as of today: You are given a budget of $1000. You need to present your findings in two months.
Please fill in the google form to list your platform where surveys will be launched and collected HERE
A good proposal will give an accurate estimation with the least amount of money used.
Are women underrepresented in science in general? How does gender
relate to the type of educational degree pursued? Does the number of
higher degrees increase over the years? In an attempt to answer these
questions, we assembled a data set (WomenData_06_16.xlsx)
from NSF
about various degrees granted in the U.S. from 2006 to 2016. It contains
the following variables: Field (Non-science-engineering
(Non-S&E) and sciences (Computer sciences,
Mathematics and statistics, etc.)), Degree
(BS, MS, PhD), Sex
(M, F), Number of degrees granted, and
Year.
Our goal is to answer the above questions only through EDA (Exploratory Data Analyses) without formal testing. We have provided sample R-codes in the appendix to help you if needed.
Notice the data came in as an Excel file. We need to use the package
readxl and the function read_excel() to read
the data WomenData_06_16.xlsx into R.
Field,Degree, Sex,
Year and Number )We can count the number of NA values, if there are
any
There are no missing values
5 fields: Field, Degree, Sex, Year, Number
We can count the amount of unique entries in the Field variable
There are 10 Fields possible
BS MS PhD
11
Is there evidence that more males are in science-related fields vs
Non-S&E? Provide summary statistics and a plot which
shows the number of people by gender and by field. Write a brief summary
to describe your findings.
## `summarise()` has grouped output by 'SE'. You can override using the `.groups`
## argument.
Is there evidence that more males are in science-related fields vs
Non-S&E? –> There are more people overall in
nons&e fields, in which there are more women. However, there are
more men in the S&E fields from these plots.
Describe the number of people by type of degree, field, and gender. Do you see any evidence of gender effects over different types of degrees? Again, provide graphs to summarize your findings.
### answer The proportion of non s&e fields and the degree types is
relatively the same, but more variability in the s&e degrees. males
and females obtain about the same science BS degrees (females slightly
more), but males have more science MS and PhDs.
In this last portion of the EDA, we ask you to provide evidence numerically and graphically: Do the number of degrees change by gender, field, and time?
Do the number of degrees change by gender, field, and time? –>
Females appear to have more degrees overall at the BS and MS level, but
Males have more PhDs in STEM fields, compared to women. Males also have
more STEM MS degrees than Females.
Finally, is there evidence showing that women are underrepresented in data science? Data science is an interdisciplinary field of computer science, math, and statistics. You may include year and/or degree.
Overall, we believe there is enough graphical evidence that women are underrepresented in data science related fields. In computer science, that disparity is ony getting worse with time. Overall, there are less people in math and statistics at the BS and MS level; however, with more oerall degress in math a the PhD level, women still get about half the amount of math PhDs as men.
Summarize your findings focusing on answering the questions regarding if we see consistent patterns that more males pursue science-related fields. Any concerns with the data set? How could we improve on the study?
We would like to explore how payroll affects performance among Major League Baseball teams. The data is prepared in two formats record payroll, winning numbers/percentage by team from 1998 to 2014.
Here are the datasets:
-MLPayData_Total.csv: wide format
-baseball.csv: long format
Feel free to use either dataset to address the problems.
Payroll may relate to performance among ML Baseball teams. One possible argument is that what affects this year’s performance is not this year’s payroll, but the amount that payroll increased from last year. Let us look into this through EDA.
Create increment in payroll
a). To describe the increment of payroll in each year there are several possible approaches. Take 2013 as an example:
- option 1: diff: payroll_2013 - payroll_2012
- option 2: log diff: log(payroll_2013) - log(payroll_2012)
Explain why the log difference is more appropriate in this setup.
b). Create a new variable
diff_log=log(payroll_2013) - log(payroll_2012). Hint: use
dplyr::lag() function.
c). Create a long data table including: team, year, diff_log, win_pct
a). Which five teams had highest increase in their payroll between years 2010 and 2014, inclusive?
b). Between 2010 and 2014, inclusive, which team(s) “improved” the most? That is, had the biggest percentage gain in wins?
Log increases in payroll have a very weak linear relationship to performance overall.
Is there evidence to support the hypothesis that higher increases in payroll on the log scale lead to increased performance? Pick up a few statistics, accompanied with some data visualization, to support your answer. –> All R^2 are less than 0.5 indicating a weak linear relationship between team-to-team log increase in payroll and their performance.
Which set of factors are better explaining performance? Yearly payroll or yearly increase in payroll? What criterion is being used?
This linear model shows that raw payroll more significantly predicts performance that log payroll, although the R^2 overall is still quite weak.